arXiv:1507.00874v3 [stat.CO] 15 Dec 2015 


Adapting the ABC distance function 


Dennis Prangle* 

Abstract 

Approximate Bayesian computation performs approximate inference for models 
where likelihood computations are expensive or impossible. Instead simulations from 
the model are performed for various parameter values and accepted if they are close 
enough to the observations. There has been much progress on deciding which summary 
statistics of the data should be used to judge closeness, but less work on how to weight 
them. Typically weights are chosen at the start of the algorithm which normalise the 
summary statistics to vary on similar scales. However these may not be appropriate in 
iterative ABC algorithms, where the distribution from which the parameters are pro¬ 
posed is updated. This can substantially alter the resulting distribution of summary 
statistics, so that different weights are needed for normalisation. This paper presents 
two iterative ABC algorithms which adaptively update their weights and demonstrates 
improved results on test applications. 

Keywords: likelihood-free inference, population Monte Carlo, quantile distributions, Lotka- 
Volterra 


1 Introduction 

Approximate Bayesian computation (ABC) is a family of approximate inference methods 

which can be used when the likelihood function is expensive or impossible to compute but 
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simulation from the model is straightforward. The simplest algorithm is a form of rejection 
sampling. Here parameter values are drawn from the prior distribution and corresponding 
datasets simulated. Each simulation is converted to a vector of summary statistics s = 
(si, S 2 ,..., Sm) and a distance between this and the summary statistics of the observed data, 
Sobs, is calculated. Parameters producing distances below some threshold are accepted and 
form a sample from an approximation to the posterior distribution. 

The choice of summary statistics has long been recognised as being crucial to the quality 


of the approximation (Beaumont et ah, 2002), but there has been less work on the role of 


the distance function. A popular distance function is weighted Euclidean distance: 
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where Uj is an estimate of the prior predictive standard deviation of the ith summary statistic. 
In ABC rejection sampling a convenient estimate is the empirical standard deviation of 
the simulated s* values. Scaling by Uj in ([^ normalises the summaries so that they vary 
over roughly the same scale, preventing the distance being dominated by the most variable 
summary. 

This paper concerns the choice of distance in more efficient iterative ABC algorithms, in 


particular those of Toni et ah (2009), Sisson et ah (2009) and Beaumont et ah (2009). The 


first iteration of these algorithms is the ABC rejection sampling algorithm outlined above. 
The sample of accepted parameters is used to construct an importance density. An ABC 
version of importance sampling is then performed. This is similar to ABC rejection sampling, 
except parameters are sampled from the importance density rather than the prior, and the 
output sample is weighted appropriately to take this change into account. The idea is to 
concentrate computational resources on performing simulations for parameter values likely 
to produce good matches. The output of this step is used to produce a new importance 
density and perform another iteration, and so on. In each iteration the acceptance threshold 
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is reduced, resulting in increasingly accurate approximations. Full details of the Toni et al. 


(2009) implementation are reviewed later. 

Weighted Euclidean distance is commonly used in these algorithms with cTj values de¬ 
termined in the hrst iteration. However there is no guarantee that these will normalise the 
summary statistics produced in later iterations, as these are no longer drawn from the prior 
predictive. This paper proposes two variant iterative ABC algorithms which update their cxj 
values to appropriate values at each iteration. It is demonstrated that these algorithms pro¬ 
vide substantial advantages in applications. Also, they do not require any extra simulations 
to be performed solely for tuning. Therefore even when a non-adaptive distance performs 
adequately, there is no major penalty in using the new approach. (Some additional calcula¬ 
tions are required - calculating more cxj values and more expensive distance calculations - 
but these form a negligible part of the overall computational cost.) 


One of the proposed algorithms has similarities to the iterative ABC methods of Sedki 


et al. (2012) and Bonassi and West (2015). These postpone deciding some elements of the 


tuning of iteration t until during that iteration. Algorithm also uses this strategy but 
for different tuning decisions: the distance function and the acceptance threshold. Another 


related paper is Fasiolo and Wood (2015) which contains an illustration of the difficulty of 
choosing ABC distance weights non-adaptively. 

The remainder of the paper is structured as follows. Section reviews ABC algorithms. 
This includes some novel material on the convergence of iterative ABC methods. Section 
discusses weighting summary statistics in a particular ABC distance function. Section 
details the proposed algorithms. Several examples are given in Section Section sum¬ 
marises the work and discusses potential extensions. Finally Appendix contains technical 
material on convergence of ABC algorithms. Computer code to implement the methods 


of this paper in the Julia programming language (Bezanson et al., 2012) is available at 
https://github.com/dennisprangle/ABCDistances.jl, 
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2 Approximate Bayesian Computation 


This section sets out the necessary background on ABC algorithms. Several review papers 
(e.g. Beaumont, 2010 Csillery et ah, 2010 Marin et ah| 2012) give detailed descriptions of 


other aspects of ABC, including tuning choices and further algorithms. Sections 2.1 and 2.2 


review ABC versions of rejection sampling and PMC. Section 2.3 contains novel material on 
the convergence of ABC algorithms. 


2.1 ABC rejection sampling 

Consider Bayesian inference for parameter vector 6 under a model with density 7r{y\6). 
Let 71 (9) be the prior density and ^obs represent the observed data. It is assumed that 
77{y\6) cannot easily be evaluated but that it is straightforward to sample from the model. 
ABC rejection sampling (Algorithm exploits this to sample from an approximation to 
the posterior density 7r{6\y). It requires several tuning choices: number of simulations N, 
a threshold h > 0, a function S{y) mapping data to a vector of summary statistics, and a 
distance function 

Algorithm 1 ABC-rejection 

1. Sample 6* from 7t{6) independently for 1 < i < N. 

2. Sample y* from 7r{y\6*) independently for 1 < i < N. 

3. Calculate s* = S{y*) ior 1 <i < N. 

4. Calculate d* = d{s*,Sohs) (where Sobs = S{yohs)-) 

5. Return {9*\d* < h}. 


The threshold h may be specihed in advance. Alternatively it can be calculated following 
step 4. For example a common choice is to specify an integer k and take h to be the fcth 


smallest of the d* values (Biau et ah, 2015). 
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2.2 ABC-PMC 


Algorithm is an iterative ABC algorithm taken from Toni et ah (2009). Very similar 


algorithms were also proposed by Sisson et al. (2009) and Beanmont et ah (2009). The latter 


note that this approach is an ABC version of popnlation Monte Carlo (Cappe et al., 2004), so 


it is referred to here as ABC-PMC. The algorithm involves a seqnence of thresholds, {ht)t>i- 
Similarly to h in ABC-rejection, this can be specihed in advance or dnring the algorithm, as 
discnssed below. 

Algorithm 2 ABC-PMC (with the option of adaptive ht) 

Initialisation 


1. Let t = 1. 

Main loop 

2. Repeat following steps nntil there are N acceptances. 

(a) Sample 9* from importance density qt{9) given in eqnation (|^. 

(b) If 7i{9*) = 0 reject and retnrn to (a). 

(c) Sample y* from 7i{y\9*) and calcnlate s* = S{y*). 

(d) Accept if d(s*,Sobs) < ht. 

Denote the accepted parameters as 9\,...,9\^ and the corresponding distances as 
d\,..., d%. 

3. Let wj = tt{ 9\)/qt{9\) ioi 1 <i < N. 

4. (Optional) Let ht+i be the a qnantile of the d\ valnes. 

5. Increment t to t + 1. 

End of loop 


The algorithm samples parameters from the importance density 
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if t = 1, or t = 2 and hi = oo 
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In the first iteration (and sometimes the second, as discussed shortly) qt{0) is the prior. 


Otherwise (2b) is used, which effectively samples from the previous weighted population and 


perturbs the result using kernel Kt. Beaumont et ah (2009) show that a good choice of the 
latter is 

where (j) is the density of a normal distribution and is the empirical variance matrix of 
calculated using weights ^)i<j<Ar 

As mentioned above, the sequence of thresholds can be specihed in advance. However it 


is hard to do this well. A popular alternative (Drovandi and Pettitt, 2011a) is to choose the 


thresholds adaptively by setting ht at the end of iteration t — 1 to be the a quantile of the 
accepted distances (n.b. a < 1 is assumed throughout the paper). An optional step, step 
4, is included in Algorithm to implement this method. Alternative updating rules for ht 
have been proposed such as choosing it to reduce an estimate of effective sample size by a 


prespecihed proportion (Del Moral et ah, 2012) or using properties of the predicted ABC 


acceptance rate (Silk et al., 2013). 


If step 4 is used this leaves hi and a as tuning choices. A simple default for hi is oo, in 


which case all simulations are accepted when t = 1. In this case (2b) would give q2{0) as 


simply a modihed prior with inflated variance, which is not a sensible importance density. 
Therefore ([^ takes q2i0) = ti{9) in this case. This is a minor novelty of this presentation of 
the algorithm. 

A practical implementation of Algorithm requires a condition for when to terminate. 
In this paper the total number of datasets to simulate is specihed as a tuning parameter 
and the algorithm stops once a further simulation is required. Some alternative are possible, 
such as stopping once the algorithm falls below a target value for ht or the acceptance rate. 

Several variations on Algorithm have been proposed which are briehy discussed in 
Section]^ Some of these are ABC versions of sequential Monte Carlo (SMC). The phrase 
“iterative ABC” will be used to cover ABC-PMC and ABC-SMC. 
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2.3 Convergence of ABC-PMC 

Conditions C1-C5 ensure that Algorithm [^converges on the posterior density in an appropri¬ 
ate sense as the number of iterations tends to inhnity. This follows from Theorem which is 
described in Appendix Although only hnite computational budgets are available in prac¬ 
tice, such convergence at least guarantees that the target distribution become arbitrarily 
accurate as computational resources are increased. 

Cl. 6 G M”, s G for some m,n and these random variables have density 7r(6*, s) with 
respect to Lebesgue measure. 

C2. The sets At = {s|d(s, Sobs) < ht} are Lebesgue measurable. 

C3. 7r(sobs) > 0. 

C4. limt_,.oo \At\ =0 (where | ■ | represents Lebesgue measure.) 

C5. The sets At have bounded eccentricity. 

Bounded eccentricity is dehned in Appendix Roughly speaking, it requires that under 
any projection of At to a lower dimensional space the measure still converges to zero. 

Condition Cl is quite strong, ruling out discrete parameters and summary statistics, but 
makes proof of Theorem [T] straightforward. Condition C2 is a mild technical requirement. 
The other conditions provide insight into conditions required for convergence. Condition C3 
requires that it must be possible to simulate Sobs under the model. Condition C4 requires 
that the acceptance regions At shrink to zero measure. For most distance functions this 
corresponds to hmi_).oo ht = 0. It is possible for this to fail. Some examples encountered by 
the author in practice follow. One is when datasets close to Sobs cannot be produced under 
the model of interest. Alternatively, even if Sobs can occur under the model, the algorithm 
may converge on importance densities on 9 under which it is impossible. This corresponds 
to concentrating on the wrong mode of the ABC target distribution in an early iteration. 
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Finally, condition C5 prevents At converging to a set where some but not all summary 
statistics are perfectly matched. 

Conditions C4 and C5 can be used to check which distance functions are sensible to use 
in ABC-PMC, usually by investigating whether they hold when ht —)■ 0. For example it is 
straightforward to show this is the case when is a metric induced by a norm. 


3 Weighted Euclidean distance in ABC 


This paper concentrates on using weighted Euclidean distance in ABC. Section 3.1 discusses 


this distance and how to choose its weights. Section |3.2| illustrates its usefulness in a simple 
example. 


3.1 Definition and usage 

Consider the following distance: 


d{x,y) 
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If (Uj = 1 for all i, this is is Euclidean distance. Otherwise it is a form of weighted Euclidean 
distance. 

Many other distance functions can be used in ABC, as discussed in Section |2.3[ for 
example weighted Li distance d{x,y) = ~ l/*l- die author’s knowledge the 


only published comparison of distance functions is by McKinley et ah (2009), which found 


little difference between the alternatives. Owen et al. (2015) report the same conclusion but 
not the details. This Ending is also supported in unpublished work by the author of this 
paper. Given these empirical results this paper focuses on ([^ as it is a simple choice, but 
no claims are made for its optimality. Some further discussion on this is given in Section 
Summary statistics used in ABC may vary on substantially difierent scales. In the 













extreme case Euclidean distance will be dominated by the most variable. To avoid this, 
weighted Euclidean distance is generally used. This usually takes Ui = l/uj where cTj is an 
estimate of the scale of the ith summary statistic. (Using this choice in weighted Euclidean 
distance gives the distance function ([^ discussed in the introduction.) 


A popular choice (e.g. Beaumont et ah, 2002) of ai is the empirical standard deviation 


of the ith summary statistic under the prior predictive distribution. Csillery et al.| (2012) 
suggest using median absolute deviation (MAD) instead since it is more robust to large 
outliers. MAD is used throughout this paper. For many ABC algorithms these Uj values 
can be calculated without requiring any extra simulations. For example this can be done 
between steps 3 and 4 of ABC-rejection. ABC-PMC can be modihed similarly, resulting 
in Algorithm which also updates ht adaptively, (n.b. All of the ABC-PMC convergence 


discussion in Section 2.3 also applies to this modification.) 


3.2 Illustration 

As an illustration. Figure shows the difference between using Euclidean and weighted 
Euclidean distance with Ui = 1/ai within ABC-rejection. Here cxj is calculated using MAD. 
For both distances the acceptance threshold is tuned to accept half the simulations. In this 
example Euclidean distance mainly rejects simulations where Si is far from its observed value: 
it is dominated by this summary. Weighted Euclidean distance also rejects simulations where 
§2 is far from its observed value and is less stringent about Si. 

Which of these distances is preferable depends on the relationship between the summaries 
and the parameters. For example if si were the only informative summary, then Euclidean 
distance would preferable. In practice, this relationship may not be known. Weighted 
Euclidean distance is then a sensible choice as both summary statistics contribute to the 
acceptance decision. 

This heuristic argument supports the use of weighted Euclidean distance in ABC more 
generally. One particular case is when low dimensional informative summary statistics have 
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Algorithm 3 ABC-PMC with adaptive ht and •) 

Initialisation 

1. Let t = 1 and hi = oo. 

Main loop 

2. Repeat following steps until there are N acceptances. 

(a) Sample 9* from importance density qt{9) given in equation (|^. 

(b) If 7r(6**) = 0 reject and return to (a). 

(c) Sample y* from T:{y\9*) and calculate s* = S{y*). 

(d) Accept if d(s*, Sobs) < h* (if f = 1 always accept). 

3. lit = 1: 

(a) Calculate {ai,a 2 , ■ ■ ■), a vector of MADs for each summary statistic, calculated 
from all the simulations in step 2 (including those rejected). 

(b) Dehne d{-, •) as the distance ([^ using weights {ui)i<i<rn where ooi = l/uj. 

Denote the accepted parameters as 9\,...,9lf and the corresponding distances as 
d\,..., d%. 

4. Let w\ = tt{ 9\)/qt{9\) for 1 < f < A. 

5. Let ht be the a quantile of the d\ values. 

6. Increment f to f + 1. 

End of loop 
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Figure 1: An illustration of distance functions in ABC rejection sampling. The points show 
simulated summary statistics Si and S 2 . The observed summary statistics are taken to be 
(0, 0) (black cross). Acceptance regions are shown for two distance functions, Euclidean (red 
dashed circle) and weighted Euclidean with MAD reciprocals as weights (blue solid ellipse). 
These show the sets within which summaries are accepted. The acceptance thresholds have 
been tuned so that each region contains half the points. 


been selected, for example by the methods reviewed in Blum et al. (2013). In this situation 
all summaries are known to be informative and should contribute to the acceptance decision. 

Note that in Figure [T] the observed summaries Sobs he close to the centre of the set of 
simulations. When some observed summaries are hard to match by model simulations this 
is not the case. ABC distances could now be dominated by the summaries which are hardest 
to match. How to weight summaries in this situation is discussed in Section 


4 Methods: Iterative ABC with an adaptive distance 

The previous section discussed normalising ABC summary statistics using estimates of their 
scale under the prior predictive distribution. This prevents any summary statistic dominating 
the acceptance decision in ABC-rejection or the hrst iteration of Algorithm where the 
simulations are generated from the prior predictive. However in later iterations of Algorithm 
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the simulations may be generated from a very different distribution so that this scaling 
is no longer appropriate. This section presents two versions of ABC-PMC which avoid this 
problem by updating the distance function at each iteration. Normalisation is now based 
on the distribution of summary statistics generated in the previous (Algorithm]^ or current 
(Algorithm]^ iteration. The proposed algorithms are presented in Sections |4.1| and 4.2 


An approach along these lines has the danger that the summary statistic acceptance 
regions at each iteration no longer form a nested sequence of subsets converging on the 
point s = Sobs- To avoid this, the proposed algorithms only accept a simulated dataset at 
iteration t if it also meets the acceptance criteria of every previous iteration. This can be 
viewed as sometimes modifying the tth distance function to take into account information 


from previous iterations. Section 4.3 discusses convergence in more depth. 


4.1 First proposed algorithm 

Algorithm is a straightforward modihcation of Algorithm which updates its distance 
function at each iteration using scales derived from the previous iteration’s simulations. The 
hrst iteration accepts everything so no distance function is required. This acts as an initial 
tuning step. Note that scales are based on both accepted and rejected simulations from the 
previous iteration. This is because using just the accepted simulations would mean the scales 
are sometimes mainly determined by the previous acceptance rule, restricting the scope for 
adaptation. 

Storing all simulated s* vectors to calculate scale estimates in step 3 of Algorithm can 
be impractical. In practice storage is stopped after the hrst few thousand simulations, and 
scale estimation is done using this subset. Other tuning details of Algorithm |^~ the choice 
of perturbation kernel Kt and the rule to terminate the algorithm - are implemented as 
described earlier for ABC-PMC. 
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Algorithm 4 ABC-PMC with adaptive ht and ■) 

Initialisation 

1. Let t = 1 and hi = oo. 

Main loop 

2. Repeat following steps nntil there are N acceptances. 

(a) Sample 0* from importance density qt{d) given in eqnation (|^. 

(b) If niO*) = 0 reject and retnrn to (a). 

(c) Sample y* from 7r(^|6**) and calcnlate s* = S{y*). 

(d) If f = 1 accept. Otherwise accept if d^{s*, Sobs) < hi for all 2 < i < f. 

3. Calcnlate (crjjcr^,...), a vector of MADs for each snmmary statistic, calcnlated from 
all the simnlations in step 2 (inclnding those rejected). 

4. Dehne •) as the distance (|^ nsing weights {uji)i<i<rn where oji = l/cTj. 

Denote the accepted parameters as 9\,... ,9% and the corresponding distances nnder 
•) as d \~^^,..., 

5. Let wl = 7r{9j)/qt{9j) for 1 < i < A. 

6. Let ht_|_i be the a qnantile of the valnes. 

7. Increment f to f + 1. 

End of loop 
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4.2 Second proposed algorithm 

Algorithm normalises simulations in iteration t based on scales derived in the preceding 
iteration. This could be inappropriate if two consecutive iterations sometimes generate 
simulations from markedly different distributions. Algorithm addresses this problem. 

A naive approach would be to start iteration t by tuning using an additional set of 

simulations based on parameters drawn from the current importance distribution. However 
this imposes an additional cost. Instead the algorithm makes a single large set of simulations. 
These are hrst used to construct the fth distance function. Then the best N simulations are 
accepted and used to construct the next importance distribution. 

A complication is deciding how many simulations to make for this large set. There 
must be enough that N of them are accepted. However the distance function dehning 
the acceptance rule is not known until after the simulations are performed. The solution 
implemented is to continue simulating until M = [iV/a] simulations pass the acceptance 
rule of the previous iteration. Let A be the set of these simulations and B be the others. 
Next the new distance function is constructed (based on .4, U H) and the N with lowest 
distances (from A) are accepted. The tuning parameter a has a similar interpretation to the 
corresponding parameter in Algorithms and the acceptance threshold in iteration t is 
the a quantile of the realised distances from simulations in A. 

Using this approach means that, as well as adapting the distance function, another dif¬ 
ference with Algorithms and is that selection of ht is delayed from the end of iteration 
f — 1 to part-way through iteration t (and therefore hi does not need to be specihed as a 
tuning choice.) If desired, this novelty can be used without adapting the distance function. 
Such a variant of Algorithm was tried on the examples of this paper, but the results are 
omitted as performance is very similar to Algorithm 

Given the same importance density and acceptance rule, an iteration of Algorithm 
requires the same expected number of simulations as Algorithms and In this sense their 
costs are the same. In practice, the algorithms select their importance density and acceptance 
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rules differently so this comparison of their computational costs is limited. Section [^contains 
empirical comparisons in terms of the mean squared error for a given number of simulations. 


Algorithm 5 ABC-PMC with adaptive ht and d^{-, •) 

Initialisation 

1. Let t = 1. 

Main loop 

2. Repeat following steps until there are M = [iV/a] acceptances. 

(a) Sample 9* from importance density qt{9) given in equation (|^. 

(b) If 7r(6**) = 0 reject and return to (a). 

(c) Sample y* from T:{y\d*) and calculate s* = S{y*). 

(d) If t = 1 accept. Otherwise accept if d\s*, Sobs) < hi for all i < t. 

Denote the accepted parameters as 91,, 9\j and the corresponding summary vectors 
as s*,... ,s^. 

3. Calculate ...), a vector of MADs for each summary statistic, calculated from 

all the simulations in step 2 (including those rejected). 

4. Dehne d^{-, ■) as the distance (|^ using weights {ujj)i<i<m where ca* = l/u*. 

5. Calculate d* = d^{s*, Sobs) for 1 < i < M. 

6. Let hf be the Ath smallest d* value. 

7. Let (6**)i<j<jv be the 9* vectors with the smallest d* values (breaking ties randomly). 

8. Let wl = 7r{9j)/qt{9j) for 1 < i < A. 

9. Increment t to t + 1. 

End of loop 


The comments at the end of Section 4.1 on tuning details and storing s* vectors also 


apply to Algorithm 
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4.3 Convergence 


This section shows that conditions for the convergence of Algorithms and in practice 


are 


essentially those described in Section 2.3 for standard ABC-PMC plus one extra requirement: 


_ bounded above. 

^ min • 


In more detail, conditions ensuring convergence of Algorithms and can be taken 
from Theorem [T] in Appendix These are the same as those given for other ABC-PMC 


algorithms in Section 2.3 with the exception that the acceptance region At is now defined as 
Sobs) < hi for all i < t}. Two conditions behave differently under this change: C4 

and C5. 

Condition C4 states that lim 4 _,.oo \At\ = 0 i.e. Lebesgue measure tends to zero. The 
definition of At for Algorithms and ensures \At\ is decreasing in t. However it may not 
converge to zero. Reasons for this are the same as why condition C4 can fail for standard 


ABC-PMC, as described in Section 2.3 


Condition C5 is bounded eccentricity (defined in Appendix 0 of the At sets. Under 
distance ([^ this can easily be seen to correspond to e* having an upper bound. This is not 
guaranteed by Algorithms and but it can be imposed, for example by updating u\ to 
uj\ + 5 maxj after step 4 for some small 5 > 0. However this was not found to be necessary 
in any of the examples of this paper. 


5 Examples 

This section presents three examples comparing the proposed and existing ABC-PMC algo¬ 
rithms: a simple illustrative normal model, the gf-and-Zc distribution and the Lotka-Volterra 
model. 
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5.1 Normal distribution 


Suppose there is a single parameter 9 with prior distribution A^(0,100^). Let Si ~ N{9, 0.1^) 
and S 2 ~ iV(0,1^) independently. These are respectively informative and uninformative 
summary statistics. Let Sobs,i = Sobs ,2 = 0. 

Figures!^ and [^illustrate the behaviour of ABC-PMC for this example using Algorithms 
[^ (with adaptive choice of ht), [^ and For ease of comparison the algorithms use the 
same random seed, and the distance function and hrst threshold value hi for Algorithms [^ 
and [^ are specihed to be those produced in the hrst iteration of Algorithm The effect is 
similar to making a short preliminary run of ABC-rejection to make these tuning choices. 
All algorithms use N = 2000 and a = 1/2. (Empirical tests show that a ~ 1/2 minimises 
mean squared error for all algorithms in this and the following examples.) 

Under the prior predictive distribution the MAD for si is in the order of 100 while that 
for S 2 is in the order of 1. Therefore the hrst acceptance region in Figure [^ is a wide ellipse. 
Under Algorithm [^ the subsequent acceptance regions are smaller ellipses with the same 
shape and centre. The acceptance regions for Algorithms [^ and [^ are similar for the hrst few 
iterations. After this, enough has been learnt about 6 that the simulated summary statistics 
have a diherent distribution, with a reduced MAD for Si. Hence Si is given a larger weight, 
while the MAD and weight of S 2 remain roughly unchanged. Thus the acceptance regions 
change shape to become narrower ellipses, which results in a more accurate estimation of 6, 
as shown by the comparison of mean squared errors (MSEs) in Figure [^ Note that Algorithm 
adapts its weights more quickly than Algorithm [^ and hence achieves a smaller MSE. 


5.2 g-and-k distribution 


The g-and-k distribution is a popular test of ABC methods. It is dehned by its quantile 
function: 


A + B 


1 + c 


1 - exp{-gz{x)) 
1 + exp{—gz{x)) 


[l + z{xY]’^z{x), 


(4) 
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Algorithm 2 simulations 



*• 


Algorithm 4 simulations 



Algorithm 5 simulations 



j. 


Figure 2: An illustration of ABC-PMC for a simple normal model using Algorithms]^ (non- 
adaptive distance function), and (adaptive distance functions). Top row: simulated 
summary statistics (including rejections) Bottom row: acceptance regions (note different 
scale to top row). In both rows colour indicates the iteration of the algorithm. 



Figure 3: Mean squared error of the parameter for a simple normal example using Algorithms 
[2} ID and |D 
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where z{x) is the quantile function of the standard normal distribution. Following the liter¬ 
ature ( |Rayner and MacGillivray 2002), c = 0.8 is used throughout. This leaves {A,B,g,k) 
as unknown parameters. 

The g-and-k distribution does not have a closed form density function making likelihood- 
based inference difficult. However simulation is straightforward: sample x ~ Unif(0,1) and 
substitute into Q. The following example is taken from Drovandi and Pettitt (2011b). 
Suppose a dataset is 10,000 independent identically distributed draws from the g-and-k 
distribution and the summary statistics are a subset of the order statistics: those with 
indices (1250,2500,... ,8750). (As in Fearnhead and Prangle, 2012| a fast method is used 
to simulate these order statistics without sampling an entire dataset.) The parameters are 
taken to have independent Unif(0,10) priors. 

To use as observations, 100 datasets are simulated from the prior predictive distribution. 
Each is analysed using Algorithms and All analyses uses a total of 10® simulations 
and tuning parameters N = 1000 and a = 1/2. Table shows root mean squared errors for 
the output of the algorithms, averaged over all the observed datasets. These show that the 
adaptive algorithms, and are more accurate overall for every parameter, and perform 
very similarly to each other. 

A B g k 


Algorithm 

Algorithm 

Algorithm 


0.335 0.501 0.880 0.163 
0.083 0.371 0.532 0.126 
0.081 0.373 0.523 0.126 


Table 1: Root mean squared errors of each parameter in the g-and-k example, averaged over 
analyses of 100 simulated datasets. 


More detail is now given for a particular observed dataset, simulated under parameter 
values (3,1,1.5, 0.5). Figure j^shows the estimated MSE of each parameter for each iteration 
of the three algorithms. The adaptive algorithms, and performs better throughout for 
the g and k parameters. For this dataset all the algorithms perform similarly for the location 
and scale parameters A and R, which have smaller MSE values. Table demonstrates that 
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the main difference in the final estimated posteriors is that Algorithm]^ has higher variances 
for the g and k parameters. 



A 

B 

g 

k 

Algorithm 

Algorithm 

Algorithm 

3 

4 

5 

2.98 (0.012) 
2.98 (0.012) 
2.98 (0.012) 

0.98 (0.028) 
0.97 (0.025) 
0.98 (0.024) 

1.52 (0.086) 
1.56 (0.048) 
1.56 (0.046) 

0.50 (0.081) 
0.53 (0.035) 
0.53 (0.033) 


Table 2: Estimated marginal posterior means and standard deviations (in brackets) of each 
parameter in the g-and-k example, for analysis of a particular simulated dataset. The values 
are taken from the final iteration of each algorithm, (n.b. All the estimated posteriors are 
roughly normal.) 

Figure 1^ shows some of the distance function weights produced by the algorithms. Algo¬ 
rithm places low weights on the most extreme order statistics, as they are highly variable 
in the prior predictive distribution. This is because the prior places significant weight upon 
parameter values producing very heavy tails. However by the last iteration of Algorithms 
1^ and such parameter values have been ruled out. The algorithm therefore assigns larger 
weights which provide access to the informational content of these statistics. 

5.3 Lotka-Volterra model 

The Lotka-Volterra model describes two interacting populations. In its original ecological 
setting the populations represent predators and prey. However it is also a simple example of 
biochemical reaction dynamics of the kind studied in systems biology. This section concen¬ 
trates on a stochastic Markov jump process version of this model with state {Xi,X 2 ) G 
representing prey and predator population sizes. Three transitions are possible: 

(Xi, X 2 ) -)■ (Xi -h 1 , X 2 ) (prey growth) 

(Xi, X 2 ) -)■ (Xi - 1 , X 2 -h 1 ) (predation) 

(Xi,X 2 ) —)■ (Xi,X 2 — 1 ) (predator death) 

These have hazard rates 6 * 1 X 1 , 62 X 1 X 2 and 6 * 3 X 2 respectively. Simulation is straightforward 
by the Gillespie method. Following either a transition at time t, or initiation at t = 0, the 
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Figure 4: Mean squared error of each parameter from Algorithms and for the g-dud-k 
example. 



Figure 5: Summary statistic weights used in Algorithms!^ andfor the g-axid-k example, 
rescaled to sum to 1. 
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time to the next transition is exponentially distributed with rate equal to the sum of the 
hazard rates at time t. The type of the next transition has a multinomial distribution with 


probabilities proportional to the hazard rates. For more background see for example Owen 


et al. (2015), from which the following specihc inference problem is taken. 


The initial conditions are taken to be Xi = 50, X 2 = 100. A dataset is formed of 
observations at times 2,4, 6 ,..., 32. Both Xi and X 2 are observed plus independent A^(0, cr^) 
errors, where a is hxed at exp(2.3). The unknown parameters are taken to be log 6 * 1 , log 6*2 
and log 03 . These are given independent Unif(—6,2) priors. The vector of all 32 noisy 
observations is used as the ABC summary statistics. 

A single simulated dataset is analysed (shown in Figure]^) This is generated from the 
model with 61 = 1,62 = 0 . 005,03 = 0.6. ABC analysis is performed using Algorithms]^ 
and A total of 50, 000 simulations are used in each algorithm. The tuning parameters 
are N = 200 and a = 1/2. Any Lotka-Volterra simulation reaching 100,000 transitions 
is terminated and automatically rejected. This avoids extremely long simulations, such as 
exponential prey growth if predators die out. These incomplete simulations are excluded 
from the MAD calculations, but this should have little effect as they are rare. 

Figure shows the MSEs resulting from the analyses. The adaptive algorithms, and 
1^ have similar outputs. Both produce smaller errors than Algorithm for all parameters 
after roughly 10,000 simulations. Table demonstrates that the main difference in the hnal 
estimated posteriors is that Algorithm has higher variances. Figure shows the weights 
used throughout Algorithm and those used in the hnal iteration of the others. Again the 
adaptive algorithms are similar to each other but different to Algorithm]^ Figure [^explains 
this by showing a sample of simulated datasets on which these weights are based. Under 
the prior predictive distribution (shown in the top row), at least one population usually 
quickly becomes extinct, illustrating that the prior distribution concentrates on the wrong 
system dynamics and so is unsuitable for choosing distance weights for later iterations of the 
algorithm. 
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Algorithm 

Algorithm 

Algorithm 


3 

4 

5 


log 0 i 

-0.048 (0.15) 
- 0.021 ( 0 . 10 ) 
- 0.021 ( 0 . 10 ) 


log 6*2 

-5.15 (0.21) 
-5.24 (0.11) 
-5.24 (0.11) 


log 6*3 

-0.48 (0.22) 
-0.56 (0.13) 
-0.55 (0.12) 


Table 3: Estimated marginal posterior means and standard deviations (in brackets) of each 
parameter in the Lotka-Volterra example, for analysis of a particular simulated dataset. The 
values are taken from the hnal iteration of each algorithm. The true values are 0, —5.30 and 
—0.51. (n.b. All the estimated posteriors are roughly normal.) 





Figure 6 : Mean squared error of each parameter (i.e. log 0i, log 6 * 2 , log 03 ) from ABC-PMC 
output for the Lotka-Volterra example. 
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6 Discussion 


This paper has presented two ABC-PMC algorithms with adaptive distance functions. The 
algorithms adapt the structure to ABC-PMC by using the output of existing simulation steps 
to adapt their distance functions. Therefore they have a similar computational cost for the 
same number of iterations. Furthermore, their convergence properties are similar to ABC- 
PMC. Several examples have been shown where the new algorithms improve performance. 
This is because in each example the scale of the summary statistics varies signihcantly 
between prior and posterior predictive distributions. Of the two algorithms, Algorithm is 
simpler to implement, involving only a small modihcation to standard ABC-PMC, and has 
essentially the same performance to Algorithm in two of the three examples. Algorithm 
performs better in the example of Section suggesting it is preferable in situation where 
continual adaptation is required. The remainder of this section discusses possibilities to 
extend this work. 

Several variations on ABC-PMC have been proposed in the literature. The adaptive 
distance function idea introduced here can be used in most of these. This is particularly 


simple for ABC model choice algorithms (e.g. Toni et ah, 2009). Here, instead of proposing 
6 * values from an importance density, pairs are proposed, where m* is a model 

indicator. This could be implemented in Algorithms and while leaving the other details 


unchanged. Drovandi and Pettitt (2011a), Del Moral et ah (2012) and Lenormand et ah 


(2013) propose ABC-SMC algorithms which update the population of (6*, s) pairs between 
iterations in different ways to ABC-PMC. In all of these it seems possible to update distance 
functions using the strategies of Algorithms]^ andHowever some of these variations would 
require additional convergence results to those given in Appendix 

Several aspects of Algorithms and could be modihed. One natural alternative is to 
use Mahalanobis-style distance functions d^{x, y) = [(a; — y)'^W^{x — y)\ where is an 
estimate of the precision matrix. Scenarios exist in which this performs much better than 
weighted Euclidean distance, (j^ (Sisson, personal communication). However exploratory 
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work found it gave similar or worse performance for the examples in this paper. Distance 
(|^ is preferred here for this reason, and also because its weights are easier to interpret and 
there are more potential numerical difficulties in estimating a precision matrix. Nonetheless, 
for other problems it may be worth considering both alternatives. 

Another reason it may be desirable to modify the distance function ([^ is if some summary 
statistic, say Sj, has an observed value far from most simulated values. In this case |sobs,i —Sjl 
can be much larger than at, and so Sj can dominate the distances used in this paper. It is 
tempting to downweight Sj so that the others summaries can also contribute. Finding a good 
way to do this without ignoring Sj altogether is left for future work. 

Algorithms]^ and [^update the distance function at each iteration. There may be scope 
for similarly updating other tuning choices. It is particularly appealing to try to improve 


the choice of summary statistics as the algorithm progresses (as suggested by Barnes et al. 


2012 ) Summary statistics could be selected at the same time as the distance function based 


on the same simulations, for example by a modihcation of the regression method of Fearnhead 


and Prangle (2012). Further work would be required to ensure the convergence of such an 


algorithm. 
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A Convergence of ABC-PMC algorithms 

Algorithm is an ABC importance sampling algorithm. This appendix considers a sequence 
of these algorithms. Denote the acceptance threshold and distance function in the fth element 
of this sequence as ht and •). The ABC-PMC algorithms in this paper can be viewed as 
sequences of this form with specihc choices of how ht and d* are selected. Note ABC-rejection 
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is a special case of Algorithm with q{9) = vr(6*), so this framework can also investigate its 
convergence as h —0. 


Algorithm 6 ABC importance sampling 

1. Sample 6* from density q{6) independently for 1 < i < N. 

2. Sample y* from 'n'{y\6*) independently for 1 < i < N. 

3. Calculate s* = S{y*) for 1 < i < A. 

4. Calculate d* = d{s*,Sohs)- 

5. Calculate w* = 7i{9*)/q{9*) (where 7i{9) is the prior density) 

6. Return {{9*,w*)\d* < h}. 


The output of importance sampling is a weighted sample {9i, Wi) 


l<i<P 


for some value of P. 


A Monte Carlo estimate of A[/i(6')|sobs] for an arbitrary function h{-) is then por 


large P this asymptotically equals (as shown in Prangle, 2011 for example) the expectation 
under the following density: 


7rABC,t(6'|sobs) OC J 7r(s|6')7r(6')l[dt(s,Sobs) < ht]ds, 
known as the ABC posterior. 

Theorem 1. Under conditions C1-C5, \rmt^^TiABC,t{9\So})s) = 7r(6'|Sof,s) for almost every 
choice of {9, Sobs) (with respect to the density 7i{9, s)). 

The conditions are: 

Cl. 9 G M”, s G M”* for some m,n and these random variables have density 7t{9,s) with 
respect to Lebesgue measure. 

C2. The sets At = {s|(ii(s, Sobs) < ht} are Lebesgue measurable. 

C3. 7r(sobs) > 0. 
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C4. limt_^oo \At\ =0 (where | ■ | represents Lebesgue measure.) 

C5. The sets At have bounded eccentricity. 

The dehnition of bounded eccentricity is that for any At, there exists a set = {s | | |s — 
Sobsib such that At C Bt and \At\ > c\Bt\, where ||.|| denotes the Euclidean norm and 

c > 0 is a constant. 

Proof. Observe that: 


,. \ V j T^{0,s)l{s ^ At)ds 

hm ttabcIc' Sobs) = hm ^ a jo 

t^oo t-s-oo J 7r(6', s)l(s G AtjdsdO 

= hm 


hmt^^j^f^^^^7r(d,s)ds 
^ IsgA, 


lim 


71 


{d, Sobs) 
"^(^obs) 

= 7l{9\Sohs)- 


almost everywhere 


The fourth equality follows from the Lebesgue differentiation theorem, which requires 


conditions C4 and C5. For more details see Stein and Shakarchi (2009) for example. 
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Relative weight 



Figure 7: Summary statistic weights used in ABC-PMC for the Lotka-Volterra example, 
rescaled to sum to 1. 
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Figure 8: Observed dataset (black points) and samples of 20 simulated datasets (coloured 
lines) for the Lotka-Volterra example. The top row shows simulations from step 2 of the hrst 
iteration of Algorithmic The bottom row shows simulations from step 2 of the last iteration 
of Algorithm [C These are representative examples of the simulations used to select the 
weights shown in Figure [C Simulations for Algorithm |C are not shown but are qualitatively 
similar to the bottom row. 
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